33 research outputs found

    Few-shot Semantic Segmentation with Self-supervision from Pseudo-classes

    Get PDF
    Despite the success of deep learning methods for semantic segmentation, few-shot semantic segmentation remains a challenging task due to the limited training data and the generalisation requirement for unseen classes. While recent progress has been particularly encouraging, we discover that existing methods tend to have poor performance in terms of meanIoU when query images contain other semantic classes besides the target class. To address this issue, we propose a novel self-supervised task that generates random pseudo-classes in the background of the query images, providing extra training data that would otherwise be unavailable when predicting individual target classes. To that end, we adopted superpixel segmentation for generating the pseudo-classes. With this extra supervision, we improved the meanIoU performance of the state-of-the-art method by 2.5% and 5.1% on the one-shot tasks, as well as 6.7% and 4.4% on the five-shot tasks, on the PASCAL-5i and COCO benchmarks, respectively

    Two-View Geometry Scoring Without Correspondences

    Get PDF
    Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of 'consensus'. We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet), which infers a score for a pair of overlap-ping images and any proposed fundamental matrix. It does not rely on sparse correspondences, but rather embodies a two-view geometry model through an epipolar attention mechanism that predicts the pose error of the two images. FSNet can be incorporated into traditional RANSAC loops. We evaluate FSNet onfundamental and essential matrix estimation on indoor and outdoor datasets, and establish that FSNet can successfully identify good poses for pairs of images with few or unreliable correspondences. Besides, we show that naively combining FSNet with MAGSAC++ scoring approach achieves state of the art results

    Shape knowledge for segmentation and tracking

    No full text
    The aim of this thesis is to provide methods for 2D segmentation and 2D/3D tracking, that are both fast and robust to imperfect image information, as caused for example by occlusions, motion blur and cluttered background. We do this by combining high level shape information with simultaneous segmentation and tracking.We base our work on the assumption that the space of possible 2D object shapes can be either generated by projecting down known rigid 3D shapes or learned from 2D shape examples. We minimise the discrimination between statistical foreground and background appearance models with respect to the parameters governing the shape generative process (the 6 degree-of-freedom 3D pose of the 3D shape or the parameters of the learned space). The foreground region is delineated by the zero level set of a signed distance function, and we define an energy over this region and its immediate background surroundings based on pixel-wise posterior membership probabilities. We obtain the differentials of this energy with respect to the parameters governing shape and conduct searches for the correct shape using standard non-linear minimisation techniques.This methodology first leads to a novel rigid 3D object tracker. For a known 3D shape, our optimisation here aims to find the 3D pose that leads to the 2D projection that best segments a given image. We extend our approach to track multiple objects from multiple views and propose novel enhancements at the pixel level based on temporal consistency. Finally, owing to the per pixel nature of much of the algorithm, we support our theoretical approach with a real-time GPU based implementation.We next use our rigid 3D tracker in two applications: (i) a driver assistance system, where the tracker is augmented with 2D traffic sign detections, which, unlike previous work, allows for the relevance of the traffic signs to the driver to be gauged and (ii) a robust, real time 3D hand tracker that uses data from an off-the-shelf accelerometer and articulated pose classification results from a multiclass SVM classifier.Finally, we explore deformable 2D/3D object tracking. Unlike previous works, we use a non-linear and probabilistic dimensionality reduction, called Gaussian Process Latent Variable Models, to learn spaces of shape. Segmentation becomes a minimisation of an image-driven energy function in the learned space. We can represent both 2D and 3D shapes which we compress with Fourier-based transforms, to keep inference tractable. We extend this method by learning joint shape-parameter spaces, which, novel to the literature, enable simultaneous segmentation and generic parameter recovery. These can describe anything from 3D articulated pose to eye gaze. We also propose two novel extensions to standard GP-LVM: a method to explore the multimodality in the joint space efficiently, by learning a mapping from the latent space to a space that encodes the similarity between shapes and a method for obtaining faster convergence and greater accuracy by use of a hierarchy of latent embeddings.</p

    Say yes to the dress: shape and style transfer using conditional GANs

    No full text
    Objects are defined by their shape and visual style. Previous work into image manipulation has generally altered the stylistic appearance of a whole image, while maintaining the image content and object shapes. In this paper we transfer both the shape and style of chosen objects between images, leaving the remaining areas unaltered. To tackle this problem, we propose a two stage method, where each stage contains a generative adversarial network, that will alter the shape and style of objects in a subject image to reflect a donor image. We demonstrate the effectiveness of our method by transferring clothing between images

    Neighbourhood-insensitive point cloud normal estimation network

    No full text
    We introduce a novel self-attention-based normal estimation network that is able to focus softly on relevant points and adjust the softness by learning a temperature parameter, making it able to work naturally and effectively within a large neighbourhood range. As a result, our model outperforms all existing normal estimation algorithms by a large margin, achieving 94.1% accuracy in comparison with the previous state of the art of 91.2%, with a 25x smaller model and 12x faster inference time. We also use point-to-plane Iterative Closest Point (ICP) as an application case to show that our normal estimations lead to faster convergence than normal estimations from other methods, without manually fine-tuning neighbourhood range parameters. Code available at https://code.active.vision

    Using learning of speed to stabilize scale in monocular localization and mapping

    No full text
    Monocular visual localization and mapping algorithms are able to estimate the environment only up to scale, a degree of freedom which leads to scale drift, difficulty closing loops, and eventual failure. This paper describes an imagedriven approach for scale-drift correction which uses a convolutional neural network to infer the speed of the camera from successive monocular video frames. We obtain continuous drift correction, avoiding the need for explicit higherlevel representations of the map to resolve scale. We also propose a novel method of including speed estimates as a regularizer in bundle adjustment which avoids the pitfalls of sudden imposition of scale knowledge. We demonstrate our approach using long-distance sequences for which ground truth is available, and find output that is essentially free of scale drift. We compare the performance with number of other methods for scale-drift correction from monocular data, and show that our solution achieves more accurate results

    BNV-fusion: dense 3D reconstruction using bi-level neural volume fusion    

    No full text
    Dense 3D reconstruction from a stream of depth images is the key to many mixed reality and robotic applications. Although methods based on Truncated Signed Distance Function (TSDF) Fusion have advanced the field over the years, the TSDF volume representation is confronted with striking a balance between the robustness to noisy measurements and maintaining the level of detail. We present Bi-level Neural Volume Fusion (BNV-Fusion), which leverages recent advances in neural implicit representations and neural rendering for dense 3D reconstruction. In order to incrementally integrate new depth maps into a global neural implicit representation, we propose a novel bi-level fusion strategy that considers both efficiency and reconstruction quality by design. We evaluate the proposed method on multiple datasets quantitatively and qualitatively, demonstrating a significant improvement over existing methods

    Interpolating convolutional neural networks using batch normalization

    No full text
    Perceiving a visual concept as a mixture of learned ones is natural for humans, aiding them to grasp new concepts and strengthening old ones. For all their power and recent success, deep convolutional networks do not have this ability. Inspired by recent work on universal representations for neural networks, we propose a simple emulation of this mechanism by purposing batch normalization layers to discriminate visual classes, and formulating a way to combine them to solve new tasks. We show that this can be applied for 2-way few-shot learning where we obtain between 4% and 17% better accuracy compared to straightforward full fine-tuning, and demonstrate that it can also be extended to the orthogonal application of style transfer

    Interpolating convolutional neural networks using batch normalization

    No full text
    Perceiving a visual concept as a mixture of learned ones is natural for humans, aiding them to grasp new concepts and strengthening old ones. For all their power and recent success, deep convolutional networks do not have this ability. Inspired by recent work on universal representations for neural networks, we propose a simple emulation of this mechanism by purposing batch normalization layers to discriminate visual classes, and formulating a way to combine them to solve new tasks. We show that this can be applied for 2-way few-shot learning where we obtain between 4% and 17% better accuracy compared to straightforward full fine-tuning, and demonstrate that it can also be extended to the orthogonal application of style transfer

    Real-time 3D tracking and reconstruction on mobile phones

    No full text
    We present a novel framework for jointly tracking a camera in 3D and reconstructing the 3D model of an observed object. Due to the region based approach, our formulation can handle untextured objects, partial occlusions, motion blur, dynamic backgrounds and imperfect lighting. Our formulation also allows for a very efficient implementation which achieves real-time performance on a mobile phone, by running the pose estimation and the shape optimisation in parallel. We use a level set based pose estimation but completely avoid the, typically required, explicit computation of a global distance. This leads to tracking rates of more than 100 Hz on a desktop PC and 30 Hz on a mobile phone. Further, we incorporate additional orientation information from the phone's inertial sensor which helps us resolve the tracking ambiguities inherent to region based formulations. The reconstruction step first probabilistically integrates 2D image statistics from selected keyframes into a 3D volume, and then imposes coherency and compactness using a total variational regularisation term. The global optimum of the overall energy function is found using a continuous max-flow algorithm and we show that, similar to tracking, the integration of per voxel posteriors instead of likelihoods improves the precision and accuracy of the reconstruction
    corecore